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ABSTRACT 


This work presents a systematic method for transforming an 
algorithm represented by mathematical expressions into a 
systolic array architecture Systolic arrays are highly 
structured architectures tailored to a specific application 
They have specific architectural properties such as simple 
processing elements < cells) simple and regular data and control communication and 
local cell interconnections 

The method consists of three major steps algorithm 

representation algorithm model and architecture specification 
The algorithm representation involves the translation of the 
algorithm into a set of locally recursive equations In 
algorithm model step a Dependence Graph (DG) is obtained from 
these recursive equations From this model the computations 
are first scheduled and then grouped among a set of cells such 
that the systolic array characterstics are preserved This 
grouping of computations is done via geometric projections The 
valid projection directions, referred to as projection vectors are 
systematically determined from the DG In architecture 

specification step processor basis A corresponding to each 
valid projection vector is determined Then a geometric 
transformation matrix T «= ^t is formed The design information 

such as number of cells, operations performed in each cell and 
data timings are extracted from the transformed index set Cell 
interconnections and data movement are extracted from the 



transformed dependencies 


The method produces all possible ( par t i t i onabl e/ 
non-par t 1 t 1 onabl e ) systolic solutions for the algorithm under 
consideration and is the simplest and computationally less miens u^e of 
all the existing methods 



transformed dependencies 


The method produces all possible (parti tionable/ 
non-par t 1 t 1 onabl e > systolic solutions for the algorithm under 
consideration and is the simplest and computationally less intensive of 
all the existing methods 
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CHAPTER 1 


INTRODUCTION 

Will wou walk a little 
faster'? said a whitting 
to a snail 

The revolution in VLSI technology allows the implementation of 
powerful computational systems on a VLSI chip using multiple 
and regularly connected processing elements These VLSI systems 
that combine pipelining and multiprocessing have been referred 
to as systolic arrays til Many designs using the systolic 
array concept exist for computational algorithms in signal and 
image processing and matrix manipulation [1-51 

The systematic transformation of an algorithm into a systolic 
array is a design automation problem of current interest The 
development of a systematic methodology for transforming 
algorithms into systolic arrays has been addressed IZ 5-151 
The power as well as the limitations of these methodologies 
have been recorded [191 In many of the existing methodologies 
the final systolic array is the result of several ad hoc 
modifioatio ns to the original design The others use linear 

transf ormati ons and often choose the transf ormations in an ad 

( 

hoc manner 
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In general < the systemaiio construction of systolic arrays can 
be grouped into three major steps [19] 

1 Algorithm Representation An algorithm expressed by some 
high-level construct e g mathematical expressions is 
represented in a suitable format such as recursive eciuations 

2 Algorithm Model The algorithm representation from step 1 
IS modeled for systematic systolic design 

3 Architectural Specification Prom the algorithm model the 
systolic array specifications such as the number of cells 
design dimension, cell interconnections operations performed 
in each cell data movement and data timing are extracted 

The above three steps may or may not all be contained in a 
given methodology Furthermore each major step may consist of 
several sub-steps 

This research describes a simple methodology for systematic 
construction of systolic arrays and includes all three major 
steps For step 1 the algorithm is represented by recursive 
equations In step 2 these recursive equations are represented 
by a graph which portrays the dependences among the 

computations This graph is known as Dependence Graph CDG] 
From this model, various computations are first scheduled and 
then grouped among a set of cells (processing elements) such 
that the systolic array characterstics which includes the use 
of simple cells, simple and regular data and control 
communications and local cell interconnections are preserved 





This grouping of computations is done via geometric 
projections Not all geometric projections will produce 
systolic arrays The valid projection directions referred to 
as projection vectors are systematically derived from this 
DG In step 3 processor basis A corresponding to each valid 
projection vector is obtained Then a geometric transformation 
matrix T»>^t is formed The design information such as number 
of cells, operations performed in each cell and data timing are 
extracted by applying this transformation matrix to the index 
set of computations Cell interconnections and data movement 
are extracted from the transformed dependence vector matrices 


In general algorithms suitable for systolic implementation are 
highly structured It is assumed that the number of different 
systolic solutions for any given problem is problem size 
independent Therefore, this systematic systolic array design 
approach is first applied to the smaller sized problems The 
design information such as the index transformation matrices 
and the cell interconnections are produced The design 
information is then easily extrapolated to larger problem sizes 
to produce corresponding larger systolic arrays 


1 1 PROBLEM STATEMENT 

Many approaches exist for constructing systolic arrays from 
algorithms or from initial non-systolic designs For these 
methods in general, a systolic solution is the result of both 




systamatio and ad hoc procadures Tharc are approaches which 
provide systematic methodology for constructing systolic arrays 
19 121 However these approaches may not satisfy the simple 
cell property of systolic arrays The existing methods do not 
directly incorporate the properties of systolic arrays in the 
design process It seems appropriate that knowledge of legal 
communication topologies (such as local connect i vi ty ) 
modularity (such as regular communication) and operation 
complexities (such as simple arithmetic operations) should be 
used in designing such systems with such properties A good 
approach for constructing systolic arrays should use certain 
algorithm characterstics to control cell complexity data 
communication and cell i nterconnecti vi ty Faroughis approach 
(13-151 IS directed towards incorporating the array 

characterstics in the design process But this method often 
overlooks the pipelining mechanism of systolic arrays while 
decomposing the recurrence equation of the algorithm into 
simple computations (to determine the algorithmic feature 
interrelationships) Further all these methods are 

computationally intensive 

For a given algorithm there may exist many different systolic 
solutions, each with different characteristic parameters such 
as the number of cells, throughput total execution time, cell 
interconnection pattern and data movement All the systolic 
solutions for a given algorithm need to be explored in order to 
select the best solution for the specific application This 
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work presents e simplified approach for oonstruoling systolic 
arrays that incorporates the properties of systolic arrays in 
the design process and produces all the existing solutions for 
a given algorithm description 

1 2 ORGANIZATION OF THE TECHNICAL REPORT 

Chapter 2 presents the background to this research In 
particular Section 2 1 describes the systolic array 

architecture and its advantages In Section 2 2 the literature 
on the systematic construction of systolic arrays is reviewed 

Chapter 3 presents the simple methodology developed for 

constructing systolic arrays In particular, Section 3 i 

presents the algorithm representation Section 3 2 presents 
the algorithm model which involves the construction of DG and 
obtaining valid schedule and projection vectors from this DG 
Section 3 3 presents the architecture specification 

In Chapter 4, the architectures obtained for various algorithms 
with the aid of ADSA (Automatic Design of Systolic Arrays! 
package are presented These algorithms include Upper 
triangular linear systems of eauations UX-B Matrix-Matrix 
multiplication Discrete Fourier Transform LU decomposition 
FIR filter, recursive convolution and HR filter 



CH^PTER 2 


BACKGPOUND 


When we mean to build 
we first survey the plot 
then draw the model And 
when we see the figure of 
the house then must we 
rate the cost of erection 
WILLIAM SHAKESPEARE 


The properties and advantages of systolic 
in Section 2 1 Section 2 2 presents a 
existing methods for constructing systolic 
an algorithm description or from an 
design 


arrays are presented 
brief review of the 
arrays either from 
initial non-systolic 


2 1 SYSTOLIC ARRAYS AND THEIR PROPERTIES 


Special purpose computer systems are typically used to meet 
specific performance requirements or to off-load compulations 
that are too time-oonsuming for a general purpose computer 
Since these systems have limited applicability their 
cost-ef f eoti veness has always been the deciding factor in 
guiding their development With evolving VLSI circuit 
technology, the design cost of a special-purpose system can be 
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held down only if approprmte Architectures ere used Swstolic 
arrays have been characterized as one such architecture 

Systolic array architecture consists of 

1 simple processing elements (cells) 

2 simple and regular control and data communications 

3 local cell interconnections 

A extensive pipelining and multiprocessing (i 183 


These 

cells 

in general 

perform 

simple operations 

involving a 

smal 1 

number 

of operands 

hence 

providing simple 

communication 

among 

the 

neighboring 

cells 

The complete 

array is 


constructed by repeating the cells over the chip area Regular 
communication implies that the design can be made modular 
Larger systems can be designed by combining the proven smaller 
ones The local cell interconnections provide simple and 
regular wire routing, making the design of these special 
purpose devices cost effective The use of pipelining and 
multiprocessing enable one to reach the performance requirement 
of a special-purpose system In a systolic design data flows 
in a pipelined fashion between the cells and communication 
with the outside world is performed only at the boundary cells 
All or portions of the boundary cells have input/output ports 

Kung has defined systolic designs as semi -systol io if global 
data communication is used and pure-systolic if no global 
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communication is used [13 In global communication data is 
broadcasted to or collected from some or all of the cells With 
no global communication data is passed from cell to cell and 
each datum is used only in one cell at any given time When the 
number of cells increases using global communication results 
in long connection wires expanding these non-local 
communication paths becomes difficult without slowing down the 
system clock In general pure-systolic is preferred but the 
correct choice of semi or pure-systolic design depends on 
the chip I/O limitations the specific application and the 
environment in which the chip is used The cells in a systolic 
design operate synchronously The need to distribute the clock 
signal to every cell in the design without clock skew is one 
problem of systolic architecture 

Another architecture similar to the systolic architecture is 
the Wavefront Array Processor [16 173 Unlike the systolic 
array a wavefront array does not operate synchronously Here 
the operations of each cell are controlled locally and depend 
on the availability of the input data and on the delivery of 
the previous output data to the neighboring cells Due to this 
local control of operations skewing of the input data 
generally needed in systolic designs may be avoided in 
wavefront designs 

Although a systolic array is usually a special purpose device 
customized to a specific algorithm they can be designed with 


programmable cell functions and programmable cell 
interconnection patterns to be used for a small subset of 
different algorithms 1183 In this case different functions 
may be computed in the same array by programming 
(reconfiguring) the array The number of different systolic 
arrays that can be programmed into a reconf i gurabl e array is 
limited by the number of different cell functions and the 
number of different cell interconnections Reconf i gurabl e 
systolic arrays are usually pure-systolic 

In general a systolic architecture is used to implement 
a regular and compute-bound problem A problem is compute-bound 
if the total number of operations is larger than the total 
number of inputs and outputs A problem is regular when 

repetitive computations are performed on a large set of data 
111 Since the introduction of the systolic architectures 
numerous systolic designs for various known problems have 
been proposed 11-53 Automatic construction of systolic designs 
has only recently been addressed IZ 5-153 The quality of a 
systolic design is evaluated by its correctness delay data 
flow timing, number of cells (chip area), modularity and many 
other factors such as accuracy, flexibility, fault tolerance 
and cost 

2 2 AUTOMATIC SYSTOLIC DESIGN 


In this section existing methodologies for designing systolic 


arraws are reviewed These methodologies are grouped under 
seperaie titles based on related techniques 

221 2 -OPERATOR APPROACH 


The Z-operator which is also referred to as the delay operator 
IS defined as 


or 




1 + J 


where the indices indicate displacement in time or space Mhen 
the Z-operator is applied to a mathematical expression with 
indexed variables the index differences are indicated by the 
powers of Z For example applying the Z-operator notation to 
the Finite Impulse Response (FIR> filter defined by 


Wj “5^ Sjlf *1-3 
i 

yields 

«i '=^•3*2:* Xj 
] 

The new expression is then manipulated symbolically using the 
mathematical properties of the Z-operalor Implementations 
with different properties can be derived 15] By direct 
hardware interpretation of the expression correctness of the 
implementation is assured However in general > the direct 
hardware interpretation does not have the important 
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oharacterslics of the syalolic architecture Further ad hoc 
modifications to the designs are needed 

2 2 2 INDEX TRANSFORMAT 1 0h« 

Cappello and Steiglitz use linear transformations on recurrence 
equations [51 Starting with recurrence equations describing 
the algorithm a canonical representation is obtained by 
adjoining a time index to the definition of the recurrence For 
exampl e 

Wi } t ^1 j t-1 + Wj I X^_j 

IS the canonical representation of the FIR filter The 

coordinates of the left side of the canonical form indicate the 

location where the computation on the right takes place This 

gives a geometric interpretation to the recurrence relations A 

geometric representation together with the ordering rule, 

indicates what data moves, where it moves to and when it moves 

With various linear transformations of the geometric 

representation the original computational locations change 

thus producing a different geometric shape Then, by 

factoring out the time dimension (geometric projection) 

different designs are produced The transformations are 

selected in an ad hoc manner and in [53 it was also shown that 

2 

certain transformations will lead to AP optimal design where A 
IS the area and P represents the period 


Another method that uses the index transf ormation is reported 
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in 110-125 Th« organization and operations of a systolic 
design are described by a set of 3-tuples (G F T> G 
represents the network geometry the position of each cell 
identified by its Cartesian coordinates The function F 
associated with each cell represents the operations performed 
by that cell T represents the network timing and it indicates 
for each cell the time when the operations F occur and when the 
data communication takes place 


Given an algorithm in the form of a high-level language, the 
loop bodies are ordered in a lexicographical ordering to 
eliminate the broadcasts One or more data-dependent vectors 
correspond to each variable Let x(Ij|^) and x(l 2 ) be two 
variables generated at a statement S using the indices and 

I 2 Variable x(] 2 ) is said to be dependent on a variable 

x(l^> if < ^2 ' lexicographical sense and x(lj^) is 

an input variable in the statement S Then the vector da=l 2 - Ii 
is called a data-dependent vector for variable x 


Let dj^,d 2 , dfn be the set of data-dependent vectors 

calculated for the algorithm Let D» [ dj^ d 2 d^] be the 

data-dependent matrix of the algorithm A linear transf ormation 
T» ^ when applied to D, will generate the data-dependent 
matrix of the new structure that is TD * D’ Matrix D <= 
[d^ »tJ 2 '» dm 3 represents the modified data dependences in 

the new index space and is favoured by VLSI Here XT 1 * related 
to time and S is related to the geometric properties of the 


alfiiorithm such as cell interconnections 


The transformation 


matrices are determined systematically from a set of 
predetermined cell interconnections which is represented by P 
Possible S matrices are computed according to the diaphantine 
equation S*D»P*K where matrix K indicates the utilization of 
primitive interconnections in matrix P 

This approach does not satisfy a property of systolic arrays 
namely simple cell structures The operations required in the 
body of the inner loop determine the cell complexity Those 
algorithms with complex loop bodies will produce complex cell 
structures as the data dependence vectors which are used as a 
set of algorithm charactersti cs are not exploited to obtain an 
information about the cell complexity Furthermore this method 
IS computationally more intensive To obtain all possible S s 
the above equation needs to be solved as many times as there 
are K matrices and this number is usually very large For 
example, for matrix multiplication algorithm described by A=B*C 

where A B and C are square matrices of size (3X3) with 

„ To 0 0 X X 

Pss Q i-10 0-11 1-1 there exists 729 possible K matrices of 

size (9x3) 

Li and Wah s approach is to model an algorithm represented as 
recurrence processes into a non-linear integer programming 
problem C6] Starting with an algorithm described as a set of 
linear recurrence equations, this method derives three classes 
of parameters, velocities of data flows spatial distributions 



of data and periods of data The relationships among these 
parameters are represented as constraint vector equations which 
must be satisfied in order to obtain a correct design The 
performance of a design can also be expressed in terms of the 
defined parameters Performance can be defined as execution 
time or the product of the square of execution time and the 
number of processors Optimal designs are then searched in the 
space of solutions satisfwing the constraint equations This 
search is done by ordered enumeration over a limited search 
space in time polynomial to the problem size From the 
definition of the recurrence equations the functional 
description of the cells is derived From the parameters 
mentioned above, the interconnections among cells are found 

In Faroughi s method [13-151, the algorithm represented by 
mathematical expressions, is first decomposed into a set of 
equivalent simple computations which are grouped into subsets 
based on the required set of operations and same type operands 
Then features, which include the operands of each simple 
computations the lexicographical index differences of each 
pair of operands of a simple computation and the 

lexicographical index differences of the respective operands of 
each pair of simple computations, arc extracted from the simple 
computations The properties of systolic arrays are represented 
in terms of feature interrelationships A set of information 
referred to as valid projection vectors is systematical 1 y 
determined from the feature i nterrelationshi ps 
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But this method often overlooks the pipelining mechanism of 
systolic arrays while decomposing the recurrence equation of 
the algorithm into simple computations And this is the reason 
why this method has failed to obtain a systolic architecture 
for the solution of lower triangular linear systems of 
equations Lx*b Furthermore the set of feature 
interrelationships to find the valid projection vectors is 
usually very large even for a small sized problem and many of 
the feature interrelationships are trivial 

Miranker and Winkler s method [81 is similar to that described 
in [10-4.21 Given an algorithm in a high-level language with 
nested do-loops, loop index expansions is performed in an ad 
hoc fashion to eliminate the broadcasts The normal order of 
the operations corresponds to a set of indices F which is 
referred to as the nodes of the computations F is represented 
by a graph structure For example 


B(2 4)«B(1,4) 


and 


C<2,4)-C(2.3)+A(2,4 )fB(2,4 > 

indicate that these statements require computations at nodes 
(1 4), (2,3) and (2 4) These nodes arc called the domain of 
dependence for node (2,4) In the structure representing F 
dependence vectors (edges) are formed based at node (2,4) which 
point back to the nodes in its domain of dependence For this 
example the dependence vectors are (-1,0), (0 -1) and (0,0) and 


lb 


M 

are denoted by F 


If 

For systolic dssign T is homogeneous i e the dependence 


status 

at one node is the 

same as that 

at 

any other 

node 

Furthermore 

two structures 

are isomorphic 

as 

graphs 

if 

there 

IS a 1- 

-to-1 

correspondence 

f between them 

such that 

If 

(p-q) 


IS a dependence vector then f<p)-f<q) must also be a dependence 
vector where p and q are nodes Different structures result by 
mapping p-q onto f(p)-f(q) These mapping are affine mappings 
and they are given by the linear maps An mXn integer matrix A 
IS selected as a linear mapping Let d^ d 2 dm be the 

dependence vectors of a structure then Ad^ » d^ for 1*1 m 
gives the dependence vectors of the new structure Conditions 
for the existence of A are given in t83 The first element in 
every d^ corresponds to time and the remaining elements 
indicate the space For spatial representation of the systolic 
design, a universal planar systolic array <in 2-D) is 
constructed where each connection corresponds to a dependence 
vector in space A universal planar systolic array contains all 
of the possible dependence vectors The last n-1 elements of 
every d^ represent the dependence vector in space and 
correspond to a portion of the universal systolic array 

223 DEPENDENCE GRAPH TRANSFORMATION 

Quinton s method [91 uses quasi-affine mappings Given a system 
of n uniform recurrence equations defined over some convex 
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subset D in Z <D cen be thought of as the index set of 
recurrence) and with some characteristic dependence vectors 
(which together define a dependence graph) this method starts 
by finding a timing function that maps points of D into time 
This requires identification of a convex space of feasible 
timing functions from which one can be chosen heuristical ly 
Such space can be found systematically from the knowledge of 
dependence vectors and D Next an allocation function is 
chosen and it projects 0 into space along a preselected 
direction which is chosen so that two points in D with the same 
image under the timing function do not map into the same point 
in space Once the timing and allocation functions are known 
systolic array can be systematically constructed from D The 
procedure maps each point of D into a cell which computes the 
recurrence function and receives (sends) data from (to) the 
cells which are the image of points dependent (depending) on 
the point under consideration with delays given by the timing 
function 

2 2 4 RETIMING TRANSFORMATION 

Leiserson's approach is not the transformation of an algorithm 
to a systolic design but rather the application of a retiming 
transformation to an existing synchronous system (not 
necessarily systolic) in order to improve the performance This 
method (71 starts with the design of a synchronous circuit 
whose correctness is obvious or easily verifiable This 


design is modeled as a finite rooted vertex-Mei ghted edge- 
weighted directed multigraph where nodes represent functional 
cells and edges represent interconnections heights represent 
delays of nodes and register delays of interconnection 
Retiming and k- slow down transformations are then applied to 
the original design in order to obtain a systolic design 
without global broadcasts Optimal retiming transformations can 
be selected so that the transformed circuit has the smallest 
clock period by reducing this problem to an efficiently 
solvable mixed-integer linear programming problem Optimal 
retiming transformations can also be selected so that the total 
number of registers is minimized by solving a linear- 
programming dual of a minimum-cost flow problem Extensions of 
the optimization procedures used can also take into account 
fan-out interconnection bus width, multiple hosts host timing 
constraints and geometric constraints like the number of 
registers per interconnection 

Kung's retiming transformation is applied to the time-invariant 
Signal Flow Graph (SFG) representation of algorithms 12,183 In 
the SFG in general a node represents an arithmetic or logic 
operation performed with zero delay and an edge represents a 
function or a delay Let D and D represent delay elements and 
lower case letters represent constants An SFG will have on its 
edges either a delay element or a constant indicating a 
mul tipl ication by that constant 



There are two classes of SFGs those with local 
interconnections and those with global interconnections Since 
the SFG with global interconnections such as the FFT will 
require spatially global interconnections they are not 
suitable for systolic designs Here only the SFGs with local 
interconnections are considered such as the FIR 

A general systematic procedure for transforming SFG s into 
systolic designs is based on two simple rules 

Rulei. Time-Scaling All delays may be scaled that is D — f nD 
Similarly the corresponding input and output rates have to be 
scaled with respect to the new time unit D by a factor n 

Let a cut-set of a SFG be defined as a minimal set of edges 
which partitions the SFG into two pats The edges of the cut- 
set are grouped as inbound and outbound edges depending on the 
assigned directions to the edges 

Rule2 Delay- Transfer Advancing k time units on all the 
outbound edges and delaying k time units on all the inbound 
edges and vice versa, will not affect the general system 
operation The effect of lags and advances cancel each other 
in the overall timing 

Based on these two rules it was shown that any zero-delay 
edge that is non-temporal 1 y local can be transformed into a 


good oul- 


nonzero-del aw edge This is done by selecting a 
set that includes the target edge <the zero-delay edge) but 
does not include zero-delay edges in the opposite direction to 
the target edge The cut-set could include zero-delay edges 
going in the same direction as the target edge According to 
Rule 2 the nonzero-delay edges in the opposite direction can 
give one or more spare delays to the target edge(s) If there 
are no spare delays Rule 1 is applied to increase the number 
of delays in all the edges It was shown that there exist 
'good cut-sets for SFG with local interconnections [2 183 



CHAPTERS 


A SIMPLE METHOD FOR DESIGNING SYSTOLIC ARRAYS 


Nhen you 

wish to produce a 
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by 

means of an 

instrument 

do not allow 
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to 

complicate it 


-LEONARDO DA VINCI 


This chapter presents the simple method developed for constructing 
sgstolic arrays from algorithms In particular section 3 1 
presents the elgorithm representation In section 3 2, algorithm model 
is presented which consists of developing the Dependence Graph 
(DG) and extracting the valid schedule and projection vectors 
from the DG Section 3 3 presents the extraction of architectural 
details from the information provided by algorithm model step 


3 1 ALGORITHM REPRESENTATION 


The compute bound algorithms under consideration are highly 
structured and can be described in a closed form such as a 
mathematical expression Consider the algorithm of the solution of 
loi*>er triangular linear systems of equations LXa>B where L is an NXN lower 
triangular matrix and X and B are Nxl vectors 


A closed form description of this algorithms is 


X, » (b^ 


/ 1 ,, for 1 


12 3 


N 


<3-1 ) 


To indicate the calcultations more explicitly for Ns=4 the 
eq <3-l) can be expressed as follows 
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To achieve the maximum parallelism in an algorithm the data 
dependencies in the computations must be studied carefully If 
there are no data dependencies between the operations then 
they can be executed at the same time in a parallel computer 
Eq (3-1) indicates that the resultant s are determined from 
certain specified products which are subtracted from b^ s The 
order in which the products are computed and subtracted from b^ 
IS not important This provides considerable freedom as to how 
these calculations are to be performed In general one natural 
order for the calculations is in the order of data 
availability In this algorithm, x^^ is calculated first and 
then used in the calculations of X 21 *3 «nd X 4 and this process 
repeats with all elements of vector X 


A concise and convenient expression for the representation of 
many algorithms is to use recursive equations The derivation 
of recursive equations is often straight forward The recursive 
equation for the algorithm under consideration is 
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for j <■ i 


z 
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where j is the 
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recursion index j*(N-i) 




and 


of the DG from this reoursiue 
sec 1 1 on 


N (3-3) 

2 i and 



equation is the 


3 2 ALGORITHM MODEL 


A dependence graph (DG) is a graph that shows the dependence of 
the computations that occur in an algorithm For the algorithm 
under consideration x^^ is directly dependent upon 
and By viewing each dependence relation as an arc between 

the corresponding variables located in the index space a DG 
as shown in Fig 3-i will be obtained 

The DG has two kinds of cells, a cell performing multiplication and 
subtraction operation (MS cell) and a cell performing dip'ision 
operation (D cell) Such a DG is termed as non-homogeneous DG 

In Fig 3-1, the value x^_j of each element of vector X should 
be 'broadcast to all the index points having the same (i-j) 
index Typically, broadcasts are signaled by missing indices of 
variables in the loop In general this means that global 
communication is involved in array processor design Thus such 
broadcasting of the variables has to be replaced by local 
communication whenever possible 




An •Iflonthm is localized if all the variables are dependent 
upon the variables of the neighboring nodes only Thus a 
locally recursive algorithm is an algorithm whose corresponding 


DG has 

onl y 

local dependencies 

1 e 

the length 

of 

each 

dependence arc 

IS independent 

of 

the 

problem size 

[183 

In 

order to 

avoid 

broadcast and to 

introduce pipelining 

we first 


compute all missing indices and introduce new artificial 
variables such that for esch ffenorated loanable there is only one 
destination 


For the algorithm under consideration the following forward 
recurrence equation has all the variables pipelined and 
executes the algorithm in the order of data availability 


Ri° » Xj^ / 2^^ 


( 3-4 ) 


A 1 

for 1*1 2 3 
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1 ^ S= 1 
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Not# that we have introduced a new variable R^ to propagate 
the data Xq along the broadcast contour i-j » c This is 
because, the variable x^^ has already been used in the i = c 
contour to denote the data involved in the iterative summation 
process 


The localized DG obtained from the eq (3-4) is shown in Fig 3-2 


Note that an algorithm is computable if and only if its DG 
contains no loops (an arc whose end points are the same node) 


3 


2 

1 


Fig 31 


Globally RecursiveDG for Lower Triangular Linear 
Syslem Lx = B ^ 



Fig 32 Locally Recursive DG for Lower triangular Linear 
System Lx = B 









or cycles (» chain of arcs directed in the same way whose end 
points are the same node) 

The main objective of the studies reported in this chapter is 
to derive simple techniques for determining processor array 
implementation for a DG One simple and brute-force method for 
achieving this objective is to use a distinct processor for 
executing the computations at a particular index point of the 
DG This in general leads to a very inefficient use of the 
computational resources since each processor is active only 
for a constant period of time which could be a minute fraction 
of the time required for completing the algorithm To achieve 
better utilization of these resources it is necessary to reuse 
the same processor for handling a large number of 

computations In general the set of computations that are 

allocaied to processors can be arbitrary disjoint subsets of the 
set of all computations However the problem of optimally 

scheduling' the computations that are allocated to a given 
processor becomes extremely difficult if the allocaition is 
arbitrary To render this problem tractable one can restrict 
attention to linear <planar hyperplanar) partitions Here, a 
set of parallel lines (planes hyperplanes) are drawn through 
the index space of the DG so that all computations that 

correspond to index points that lie on the same line are 
handled by the same processor In this manner, the processor 
array itself can be obtained by projecting the DG along these 
lines (planes hyperplanes) onto lower dimensional lattice of 



However not all the 


11 

points known as the processor space 
projections of the DG will meet the regular communication 
local connectivity simple control and simple cell requirements 
of the systolic array architecture The determination of which 
projections will produce systolic arrays is the crux of this 
report 

The DG for the algorithm under consideration is non-homogeneous 
consisting of MS cells and D cells A systolic solution 
produced from the projection of MS cells alone is called a sub- 
systolic array (SSA) as it is responsible for only a subset of 
computations Then the systolic array for LX = B is the 
combination of two sub-systolic arrays where one is produced 
from the projection of MS cells (SSA-1> and the other produced 
from the projection of D cells (SSA-2) The dimension of DG is 
0^2 Denote the homogeneous DG consisting of the MS cells as 
DG-1 and that of D cells as DG-2 Denote the index set of DG- 
1 as 1 and its dependency vectors by d^j^^ d ^2 Denote the 

matrix formed by dependence vectors of DG-i as Dj and that of 
dependence vectors from DG-i to DG-j as D^ j 

Then, for the algorithm under consideration 

2 r . ,ir234344 

I M L (i j) 25^1^4 2^JiJ4 l“j222334 

i2^ » t (i j) ii:i^4 j «= i 3 « 1^1 f ^ 

DG-1 has two dependence vectors dj^^ « (i 1)^ for pipelining 
the R variable and is obtained by joining the cells at 



^8 


location* <2, 2) and <3 3) and di^2 “ -1)^ 

the partial result* and is obtained by joini 
location (3 3) and (3 2) Thus «= [ J 


dependence vectors Further ^^2 



and D 2 j^ » 


for pipelining 
ng the cells at 
DG - 2 has no 

fil 


Thus there are two considerations for mapping a DG onto an 
archi lecture 

(1) In what ordering should the computations be scheduled 

(2) To which cells the computations be allocated 

The first step is the scheduling and the second step is the 
allocation The schedule function represents a mapping from the 
N-dimensional DG onto a one dimensional time space A schedule 
IS based on a set of parallel hyperplanes and is represented 
by s pointing to the normal direction of the hyperplanes All 
the nodes in the hyperplane must be processed at the same 
timestep C181 


For a homogeneous DG, a schedule vector '% is said to be 
valid, if it satisfies the following condition 

* d,^ > 0 <3-5 b> 

where d|^ s are the dependence vectors of the DG 

For a non-homogeneous DG-i a schedule vector s is said to be 
valid if it satisfies the following conditions 

* d^l^ > 0 and s*' f d^j,^ > 0 (3-5b) 

where d^j^ s are dependence vectors of DG-i and d^jl^ s are 
dependence vectors from DG-i to DG-j Morever, this schedule 


<^0 


vector should be acceptable to all other DG-i s The above 
condition is to enforce causality' in the permissible schedule 
vector Thus after finding a set of schedule vectors that 
satisfy the condition (3-5) a schedule vector which results in 
minimum computation time according to the following equation 
IS chosen tl2) 

^ ^ Max s*” * (i^ _ 17) +1 

computation time ts= p-* e (3-6) 

Min s'' * d^j^ 

where 1 and I 2 are any two index points of the DG and d^j^ 
IS any dependence vector of the DG The eq (3-6) represents the 
number of parallel hyperplanes s sweeping the index space I 


For the algorithm under consideration Djj^i 


1 D 
1-1 


and DG-Z has no 


dependence vectors Hence a schedule vector which is valid for 
DG-1 15 also valid for DG-2 It has been found that s=(Z -1)^, 
satisfies the condition (3-5) and results in minimum Computation 
time of 7 clock cycles 


ComputationiimB*= 


(2 - 1 ) * { 

4 

i 

— 

1 

1 

) +1 

(2 - 1 ) * 

0 

-1 



=* 7 clock cycles 


Thus the scheduling of the computations is done Now we need to 
allocate the computations to various processors 


The projection of the MS cells onto the } axis is the same as 
the overlapping of all the column vectors onto one column The 
projection is said to be along the vector (1 0)'' 


This vector 
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IS indicated in the space bw a line between two points with 
coordinates (0^0) and (1,0) The possible projection vectors 
for a 2-dimen5ional DG are p= <1 0)^ (-1 0)^ (0 1)^ (0 -i ) ^ 

<1 1)*' <-l -i)^ <1 -1)^ and (-1 I)*" Vectors with element 
values greater than 1 are not recommended as the resulting 
architecture needs many cells 

In the discussion that follow some conditions have been 
formulated which are to be satisfied by a vector in order for 
it to become the valid projection vector 

(1) The nodes on a equi temporal hyperplanes should not be 
projected onto the same cell This is because the cells on a 
equi temporal hyperplane have to be scheduled at the same 

timestep and if they are mapped onto a cell then that cell 

needs to do many computations at the same timestep which is 

Absurd Hence for a projection vector to be valid 

E*' I p > 0 (3-7) 

For the algorithm under consideration, s ■ (2 -1)^ Hence all 
the possible projection vectors satisfies the above condition 
Note that if p is a valid projection vector, then -p will also 
be a valid projection vector even though it voilates the above 
condition Further if for some projection vector p s‘'lp=0 
then store this vector for future use with another schedule 
vector even though the new schedule vector may result in an 
increase in the execution time 



(2) If the DG IS non-homogeneous then the projection vector 
should not be equal to the data dependence vector in DG-i if 
there exists same kind of dependence vector from DG-i to DG- j 
as it demands a complex cell structure voilating the simple 
cell feature of systolic arrays 


Tor the algorithm under discussion 


1 0 
1 -1 


and 


0 

-1 


Hence p= (0 -1) is not valid Note that the final result of 
SSA-1 needs to be passed to SSA-2 Hence i f we project the DG 
in (0 -1)^ direction the structure of the MS cells with 
partial results being fed back will not allow the final result 
to be passed to SSA-2 


(3) In the case of non-homogeneous DG suppose that there 
exists a dependence vector dj from cell-w of homogeneous DG-i 
to cell-x of DG-j Further suppose that there exists same kind 

of dependence vector d^ , within DG-i from cell-y to cell-z, 

where cell-y is in the neighbourhood of cell-w and in the 

possible projection direction Then vector joining cell-w and 
cell-y IS not valid as the output line of cell-n which is 
obtained by projecting the cell-w onto cell-y needs to be 
connected to cell-x(of SSA-j) and cell-z (of SSA-i ) depending 
upon the algorithm correctness This projection direction is 
invalid, because a DMUX is required at the output line of 

cell-n and hence complex control mechanism It is illustrated 


in Fig 3-3 
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For the algorithm under consideration there is a dependence 

o' 


vector D 


12 


-1 


from the cell with coordinates (2 2) and same 


kind of dependence vector (0 -1)^ exists within DG-i from the 
cell with coordinates (3 3) Hence the projection vector <1 i)^ 
which maps the cells at coordinates (2 2) and (3 3) onto one 
cell IS not valid Similarly the projection vector (i -1)^ 
which maps the cells at coordinates (3 3) and <4 2) is also not 
yAlid 


So far the projection vector <i 0)^ is only not int/Mhd' There are 
some more conditions of this type which are to be satisfied by 
a projection vector to be valid These conditions are described 
here although the projection vector (1 0)*' does not voilate any 
of these conditions 


(4) If the DG IS non-homogeneous and a dependence vector d^ 
is present from DG-j to DG-i and also in DG— i^then that 
dependence vector 'd^ is 'not a valid projection vector This 
IS because a variable 'computed' in SSA-j is being allowed to 
'stay' in SSA-i which needs a complex control mechanism as 
opposed to the 'simple control requirement of systolic arrays 


For the algorithm under consideration D 2 j^ 
of dependence vector <1 1)^ is present 

(1 i>*’ is not a valid projection vector 


1 

“ 1 
Ml thi n 


and same kind 
DG-i Hence 


3 ^ 


(5) In the case of non-homogeneous DG suppose that there 

exists a dependence vector d^ from oell-a of DG-i to oell-b 
of DG-j Further suppose that there exists same kind of 
dependence vector from cell-c to cell-d within DG- j where 

cell-d IS in the neighbourhood of cell-b and is in the possible 
projection direction Then the vector joining cell-b and cell-d 
IS not a valid projection vector as the input line of cell— 
n which IS obtained by projecting the cell-b onto oell-d 
needs to be connected to the cell-a (of SSA-i ) and cell-c (of 
SSA-j) It IS illustrated in Fig 3-4 

For the algorithm under consideration there is a dependence 
vector ^ 21 . * 1 f*'om the cell with coordinates (2 1) and same 

kind of dependence vector (1 1)^ is also present within DG-1 
from the cell with coordinates (2 2) Hence the projection 
vector (0 -1)^ which maps the cells at coordinates (3 2) and 
(3,3) onto one cell is not valid 

Thus for the lower triangular linear systems of equations LX=6 
the projection vector (1 0)^ is the only valid projection 

vector Constructing the systolic array, with this information 
IS the topic of the next section 

3 3 ARCHITECTURAL SPECIFICATION 

The systolic array for LX x= B algorithm can be obtained by 
projecting the corresponding DG in (i 0)^ 


direction This 
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graph-based projection procedure may be formally described in 
terms of algebraic transf ormations For this first we need to 
find the processor basis A corresponding to each valid 
projection vector to identify the processors This processor 
basis A IS denoted by nx(n-l) matrix and is orthogonal to the 
projection vector p Obviously there exists many vectors 
corresponding to the processor basis that are perpendicular to 
the projection vector p But one can choose any set of (n-1) 
vectors perpendicular to p in order to identify the 
processors because the choice of the basis does not change the 
topology of the array but only modifies the labels assigned to 
the processors 

For each remaining valid' projection vector find a set of 
vectors Am perpendicular to the projection vector p with 
element values being restricted to -1, 0 and 1 Then again 
form a new set of vectors that satisfy the 'partitioning ^ 
condition given below 

Am * d^,^ i 0 (3-8) 

where d^j^'s are dependence vectors of DG-i The above condition 
IS to be satisfied by all homogeneous DGs and is to ensure that 
the communication between the links is not cyclic 1121 If 
there are atleast (n-i) vectors which satisfies the above 

_ 

For details about the partitioning schemes that is mapping 
big sized problem onto the resulting small architecture', refer 


the Appendix 
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condition then the resulting architecture is parti t lonable 
and a set of (n-1) vectors is chosen as the processor basis A 
Otherwise the resulting architecture is non-par 1 1 1 1 onabl e 


For the algorithm under consideration <1 0)^ is the only 

valid projection vector There are two vectors namely (0 i)^ 
and <0 -1)^ with element values being restricted to -i. 0 and 1 
and perpendicular to this projection vector 


But both of these vectors does not satisfy the eq (3-8) Hence 
the resulting architecture is not parti tionabl e The 
dimension of the DC under consideration is n=2 Hence 
processor basis A consists of only one Am vector The vector 
CO 1)^ is chosen as the processor basis A 


Now form a geometric transformation matrix T* J^tj The mapping 
T should be bijection i e the mapping T should be 'onto and 
'one-to-one The necessary and sufficient condition for T to 
be bijection is that it should be non-singular i e detT#0 If 
detTsO then choose another set of (n-i) vectors as basis and 
proceed further For the algorithm under consi derati on 


‘ 2-1 
D 1 


and detT=2 With this information the architecture is 


obtained as illustrated further 


FORMING THE ARCHITECTURE 

Obtain the new index set and transformed dependencies by 


3 ? 


applying the geometric transformation matrix T to the index set 
and dependence vectors of the DG respectively Then with this 
information the systolic architecture is constructed as 
foil ows 

The last (n-i) rows of the transformed index set specifies the 

labels of the processors Hence the cells are sketched by 

noting the different labels of the processors Then links 

between the cells are obtained from transformed dependencies 

Note that the first row of the transformed dependencies 

specifies the amount of external delay on that link and the 

remaining rows are difference vectors of the labels of the 

processors and represent the physical link between the 

a 

processors For example a link means there is an external 

delay of (a-1) clock cycles on link b which exists from cell 
with label 1 to a cell with label m where bs m-1 Data timings 
are extracted from the transformed index set in which the first 
row specifies the timestep at which the computations are 
scheduled and the remaining rows specifies the processors to 
which the computations are allocated 


For the algorithm under consideration there are two index sets 
and I 2 " and the dimension n of the DG is 2 The transformed 
index sets and dependence vectors are printed below 



34 a] r24635 4‘ 
334 “ 222334 
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Thus there are three MS cells according to the last row of the 

2 

transformed index set of 1 and one D cell according to the 
last row of the transformed index set of I 2 From the 
transformed dependencies of it can be found that in sub- 

systolic array-1 (SSA-1), there exists a link (1) from the 
cell with label 2 to the cell with label 3 with no external 
delay and same kind of link exists from the cell with label 3 
to the cell with label 4 Further in SSA-i there exists a link 
(-1) from the cell with label 3 to the cell with label 2 with 
no external delay and same kind of link exists from the cell 
with label 4 to the cell with label 3 Transformed dependencies 
of T >^2 indicates that there exists a link <-l) from the cell 
with label 2 of SSA-1 to the cell with label 1 of SSA-2 There 
exists a link (i) from the cell with label 1 of SSA-2 to the 
cell with label 2 of SSA-1 according to the transformed 
dependencies of D 2 i To obtain data timings consider, for 
example the computation indexed by (21) The tranformed 
coordinates are (3 1) meaning that the computation time is 3 

and the label of the cell is 1 Notice that the tranformed 
coordinates are offset by some initial values 

The resulting architecture for solution of lower triangular linear 
systems Lx » b is shown in Fig 3-5 
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The throughput which is defined as the number of output 
components per clock cycle is equal to half i e one output 
sample per two clock cycles for the architecture in Fig 3-5 
Initial offset which is defined as the time that has to be 
elapsed till the computations start is (N-1) = 3 clock cycles 
for this architecture Its computation time is ( 2*N-1 ) = 7 

clock cycles Final offset which is defined as the extra time 
needed to clear all the output components in the pipeline when 
the last computation is over is (N-1) =s 3 clock cycles for this 
architecture Thus the total execution time required which is 
obtained by adding the initial and final offsets to the 
computation time is 13 clock cycles Further note that this 
architecture is non-parti tionable thereby requiring N cells 


to construct 









CHAPTER4 

DESIGN TOOLS AND RESULTS 


I ever loved to see 
everything in circles and 
squares 

-CERVANTES 

A software package ADSA is developed for Automatic Design of 
Systolic Arrays Section 4 1 describes the design facility The 
systolic solutions for upper triangular linear systems of 
equations matrix-matrix multiplication discrete fourier 
transform LU decomposition FIR filter recursive convolution 
and IIR filter are given in section 4 2 

4 1 DESIGN FACILITY 

The block diagram of the design facility is shown in figure A 
The input to the ADSA program is the dimension n indices and 
dependence information of the DG For a homogeneous DG the 
indices and dependence information of the DG consists of the 
index set and dependence vectors of the DG In case of 
non-homogeneous DG this information comprises of the index 
sets of various DGs and the dependence vectors within each DG-i 
and from each DG-i to DG-j 



Figure A The ADSA system block diagram 











Schedule tweeter generator This part of the program finds a set of 
schedule vectors that satisfy the eq (3-5) and then chooses a 
schedule vector which results in minimum execution time 
according to the eq (3-6) If no schedule vector exists then 
it asks the user to modify either the dependence information of 
the DG or to use a different algorithm of the same problem 

Projection i^ector generator This part of the program first finds 
all the projection veectors that satisfy the condition 
s^ * p > 0 If for some projection vector p s^ I p * 0 then 
this vector is stored for to be used with another schedule 
vector 

If the DG IS non-homogeneous , then it removes all the 

projection vectors that voilate the conditions given below 

(1) Projection vector should not be equal to the dependence 

vector from DG-i to DG-j if there exists same kind of 

dependence vector within DG-i 

(2) Projection vector should not be equal to the dependence 

vector from DG-i to DG-j if there exists same kind of 

dependence vector within DG-j 

(3) If there is a dependence vector from cell-a (an index 
of the DG) of DG-i to cell-b of DG-j and same kind of 
dependence vector exists within DG-i from cell-c to cell-d 
then the projection vector which maps the cell-a onto cell-c is 
not valid 

(4) If there is a dependence vector from cell-a of D6-i to 
cell-b of DG-J and same kind of dependence vector exists within 
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DG-i from cell-c to cell-d then the projection vector which 
maps the cell-b onto cell-d is not valid 

If no projection vector exists then it asks the user to either 
modify the algorithm or use a different algorithm of same 
problem 

Transformation matrix generator This part of the program finds the 
processor basis A which is a set of <n-l) vectors 
perpendicular to the projection vector and hence geometric 
transformation matrix T for each valid projection vector 

Finally the program prints the transformed index set and 
dependence vectors from which the architecture is constructed 

This software is developed only for the purpose of research and 
IS needed to be generalized as the maximum dimension n of the 
DG being allowed is 3 (assuming that a higher dimensional DG 
can be split into 3D DGs and use mul tiprojection ^ to obtain 
a planar systolic array) and the number of non-homogeneous DGs 
being allowed is 3 

The input to the ADSA program is the data dependence 
vectors and index set of the DG and this information is 
obtained manually This step may be automated but more research 
IS needed for this In general the solution to a problem may 

^See the Appendix for discussion on 'multiprojection' 



be provided by a number of different algorithms Each algorithm 
may lead to a different DG and hence different architectures 
For example the discrete fourier transform problem can be 
formulated as matrix-vector mul t i pi i ca t i on algorithm Further 
an entirely different DG can be obtained if we apply Horner s 
nested rule to the same problem 

At present the architecture is constructed manually using the 
transformed index set and dependencies This step can also be 
automated Further the step of finding an optimal systolic 
solution can also be automated by computing an optimization 
index such as time area etc The total required execution time 
IS available and the area results from the number of processors 
and the complexity of each processor The concepts of 
multiprojection partitioning etc can be readily implemented 

The system is used to design systolic arrays for several known 
problems The system has reproduced known systolic solutions 
for these algorithms and in some cases has produced several new 
solutions 

4 2 RESULTS 

This method for constructing systolic arrays is applied to 
several problems and results are reported Each algorithm is 
translated into locally recursive equations from which the DG 


IS obtained For each algorithm the method is applied to small 


problem size The selected problem size is the smallest size 
that represents the algorithm That is the DG generated from 
two different problem sizes should differ only in size and not 
in the number of subsets of computations For each algorithm 
the valid schedule and projection vectors are reported 
Geometric transformation matrices data flow timing and data 


movement are also reported 
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4 2 1 UPPER TRIAmULAR LINEAR SYSTEMS OF EQUATIONS UX»=B 

The algorithm for the solution of the upper triangular linear 
systems of equations UX=B where U is an NxN upper triangular 
matrix and X and B are Nxi vectors is described as 

Xi = <bj - 5!) *-113 * ^ '-•ii fo*" icsN 2 1 (4-1) 

j > 1 

For this algorithm the calculations for N=4 are explicitly 
expressed as follows 
X4 *= (b^) / 



( b3 - 

^34 

X 

5^4 

) / ^^33 

^2 * 

(b 2 - 

U24 

I 

X4 

- ^23 * '' ^22 

X 

II 

(bi - 

^14 

X 

5^4 

- Ui3 * X 3 - Ui 2 * 5^2^ ^ 


In this algorithm the elements of vector X are calculated by 
back-substitution That is x^ is calculated first and then 

used in the calculation of X 3 then both of them are used in 
the calculation of X 2 and so on The natural order of 
performing these calculations is in the order of data 
availability of the elements of vector X The following forward 
recurrence equations assume this natural order 

= Rn‘-‘ (A-3> 

*^1^ = ‘^N+l-i N+l-(i-j) 
j=<N-l) 2 i and i>o 


)+l 


0 


x/ / u^° 


^N+l-i 


'1-1 


J-i 


for i«l,2 N 
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Note th»t a new variable has been introduced to propagate 

the data Xc along the broadcast contour i-j = m This is 
because the variable x^^ has already been used in the i = m 
contour to denote the data involved in the iterative summation 
process The localized DG obtained from eci (4-3) is shown in 
the figure 4-1 


Since the DG consists of two types of cells the systolic 
solution will be a non-homogeneous array Let the DG consisting 
of MS cells be denoted by DG-i and the DG consisting of D 
cells be denoted by DG-2 It has been found that there exists 
only one valid projection vector namely (1 0)^ and the 
schedule vector (2 -1)^ results in minimum computation time of 
seven clock cycles Thus the processcr basis A corresponding to 

Hence the geometric 


this projection vector is (0 1) 

transformation matrix T= 


[ 2-1 
~ 0 1 


The transformed index sets 


and dependencies are printed below 


T 1 


T i Ij2 


T * D^ 


‘Z-ll [2343441 [246354' 

oij * [222334J “[222334 


2-l]„[l234l [1357' 

OlJ*[llllJ“[llli 



T f Di2 



T f D 


21 


‘ 2-1 
^ 0 1 
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The architecture is shown in figure 4-2 which is constructed 
from the transformed index sets and dependencies Note that the 
last row of the transformed index set specify the labels of the 
processor Thus there exists three MS cells and one D cell The 

fol 

dependence vector of variable x is transformed into a link 

(-1) with no external delay Thus there exists a link (-1) 
within sub-systolic array-i <SSA-1) consisting of MS cells 
from the cell with label 3 to the cell with label 2 and from 
the cell with label 4 to the cell with label 3 Similarly 
various links are obtained by noting that the transformed 
dependencies T*Dj indicates the links that exists within SSA-i 
where as j indicates the link that exists from SSA-i to 

SSA-j After sketching the architecture the data timings are 
extracted from the transformed index set in which the first row 
specifies the time at which a computation is scheduled and the 
remaining rows specify the processor to which that computation 
IS allocated 

This architecture has an initial offset of (N-i)«3 clock 
cycles Its computation time is (2*N-i)»7 clock cycles Since 
there is only one output component per two clock cycles the 
throughput is half Note that this architecture is non- 
par t 1 t 1 onabl e , hence the number of cells required depends upon 
the size N of the problem 
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422 MATRIX-MATRIX MULTIPLICATION 


Consider matrix-matrix multiplication algorithm C= A»B where 

A(a^j) IS NXM matrix is MxQ matrix and C(Cjj) is NXQ 

matrix This algorithm is described by 

'= 13 = ^ *ik * ^k 3 2 N and 3=1 2 Q (4-4) 

k = i 


The locally recursive version of this algorithm is 


1 3 


0 


'1 3 


'1 3 


'^l 3 ■*■*13 * ^1 3 


*1(3-1) 


M 


"-1 3 
*lk 


1 3 


‘ll 


( 4-5 ) 


'1 3 


= b 


(1-1)3 


•=k3 = ^13 


The DG obtained from eq (4-5) is shown in figure 4-3 Since 

the DG contains only one type of cell the systolic solutions 

are homogeneous arrays Further each cell does the multiply and 

accumulate (MAC) operation and hence is not indicated For N=3 

M=3 and Q=3, the index set and dependence vectors of the DG are 
given below 


111111111222222222333333333 

111222333111222333111222333 

123123123123123123123123123 


D 


0 10 
10 0 
0 0 1 


The first dependence vector (0 1 0)^ is due to the pipelining 
of the variable a, the second vector is due to the pipelining 
of the variable b and the last vector corresponds to the 
accumulation of partial results that is the variable c 

CCNT""’'L 


1 ! 

da No 


\ P jq 

^ 4 @i ^ 0i.seJM»O*C^ 
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The procedures were epplied to thi* DG end found thet there 
exist thirteen valid projection vectors The processor basis A 
correspond! no to each valid projection vector is also obtained 
Geometric transformation matrices corresponding to the valid 
projection vectors are listed in table 4-1 Note the various 
schedule vectors used Architectural details are also listed in 
this table Final offset is defined as the extra time needed 
to clear all output components in the pipeline when the last 
computation is over The total execution time required is found 
by adding initial and final offsets to the computation time 
Further note that all systolic solutions are part i t i onabl e 
except the solutions 4 5 6 and 7 The systolic solutions 
corresponding to the thirteen valid projection vectors are 
shown in figure 4-4 through 4-16 The architecture of 
figure 4-4 corresponding to the projection vector (1 0 0)^ is 
obtained as follows 


The geometric transformation matrix corresponding to the 

r 


projection vector <1 0 0) is 


111 
Oil 
0 10 


The schedule vector 


(1 1 1)^ results in minimum execution time of <N+M+Q-2)=7 clock 


cycles The transformed index set and dependencies are given 
below 


TII^- 


345456567456567678567678789 

234345456234345456234345456 

111222333111222333111222333 


TiD 


‘ill 
10 1 
10 0 



The last two rows of the transformed index set specify the 
labels of the processors Thus the architecture needs nine 
processors The dependence vector <0 1 0)^ of variable a is 


transformed into a link 


with no external delay Thus there 


IS a link from the cell with label 


from the cell with label 


to cell with label 


to cell with label 


and so on 


to pipeline the variable a Further the dependence vector 
(i 0 0)^ of variable b is transformed into a self loop with 
no external delay Hence the elements of variable b will stay 
in the cells The dependence vector (D 0 1)*' of variable c is 


transformed into a link 


with no external delay The 
timestep at which a computation is scheduled and to which 
processor it is allocated are noted from the transformed 
index set Thus the data timings are extracted which completes 
the architecture 


The transformed index set and dependencies of various valid 
projection vectors are not reported here due to prolixity The 
veracity of the architectures in figure 4-5 through 4-16 can be 
verified by forming the transformed index set and dependencies 
and following the procedure illustrated for the projection 
vector (1 0 0)^ 

The best systolic solution for matrix-matrix multiplication 
depends on the specific application From the results indicated 
in table 4-1, for minimum chip area, the best solutions are 
1-3 For minimum execution time the best solutions ere 8-10 



hi 


2 

Optimal solutions based on AT or AT optimization can also be 
selected Note that the total required execution time for 
solutions i-3 depends on the implementation of the loading / 
unloading process and chip area for solutions 11-13 inculdes 
the implementation of delay elements between the cells 



labltt 4-1 comparison of archi itcturcs for 


Matrix-Matrix multiplication 


Projaction Transformation Number Initial Computation Final 
vector matrix of cells offset time offset 


1 (10 0 )*' 

2 (0 1 0 )'' 

3 (0 0 1)'' 

4 (1 1 O)*" 

5 (10 1)'' 

6 (0 1 1 )'' 

7 (1 1 1)'' 

8 ( 11 - 1 )'' 

9 (1 -1 1)'' 


111 
Oil 
0 10 

'l 1 l' 
10 1 
0 0 1 

’l 1 l' 
110 
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'l 1 l‘ 
1-11 
-111 

1 1 l‘ 
11-1 
1 0-1 

’l 1 l' 
11-1 
1-11 

'l 1 l' 
0 1-1 
-11 0 

'l 1 l' 
oil 
10 1 

'l 1 l' 
110 
oil 
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9 - 7 - 

15 2 7 2 

15 2 7 2 

15 2 7 2 
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19 - 7 


10 (-1 1 1 )'' 


111 
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10 1 


19 
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11 (1 -1 0 )'' 


2 11 
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15 2 
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Fig 4 8 Systolic Solution for Matrix -Matrix Multiplication C=A.B p=(101) 









Cm 


Cout = Cin ♦ a * b 

Fig 4 9 Systolic Solution for Matrix -Matrix Multiplicalion C=A*B p=(0tD 









I 


0 0 C„ 0 0 0 0 


0 C|,0 0 Ck 0 0 


C 0 0 CbO 0 C 


OOb^OOOO' 


0 c„ 0 0 c . 0 0 < 

Ob>p 0 b„0 0 ' 

0 0 C„0 0 ' 0 0 I < 

b,,0 0b 00 b/ 
0 b, 0 0b, 00' 


00 bnOOOO' 
t= 7 6 5 4321 




Msms 



a 

Cput ^ Cr»,4 


Fig 4 10 Systolic Solution for Matrix - Matrix Multyplicalion C=:A»B p=mi) 














OO O 























65 



Fig 414 Systolic Solution for Matrix - Matrix Multiplication 
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423 DISCRETE FOURIER TRANSFORM (DFT) 

Let Cx<n) n=0 1 N-1 3 be a finite length sequence The DFT 

of x(n) IS defined as 

Uk (4-6) 

n=0 

^ -o2TT / N 

where w = e and k=0 1 N-1 

Eq (4-6) can be represented in a matrix-vector multiplication 
format which is shown below for N=:4 


yo 


1 

1 

1 

1 


’‘O 



1 

w 

w^ 

W^ 






2 

4 

6 

* 




1 

w 

w 

w 


^2 




3 

6 

9 



W3 


1 

w 

w 

w 


^3 


(4-7 ) 


The locally recursive version of eq(4-6) is given below 


W- Ir — 4 t 

^1 “ ^1 *^1 * ^ 

y ® K 0 , 


w,*' - 




N 


^1 


‘k-1 


(4-8) 


The DG obtained from eq (4-8) is shown in figure 4-17 Since 
the DG contains only one type of cells the systolic solutions 
are homogeneous arrays Each cell does the multiply and 
accumulate operations (MAC) and hence is not indicated The 
index set and dependence vectors of the DG are given below 


fl 2 3 4 

[l 1 1 1 


1 2 3 4 1 2 3 

2 2 2 2 3 3 3 


4 1 2 3 41 
3 4 4 4 4 J 


1 0 
0 1 



The first dependence vector is of variable x and the last is of 
the variable w There exists four valid projection vectors 
namely 0) (0 1) (i 1) and (1 — i)^ and the resultins 

architectures are shown in figure 4-lB through 4-21 The 
schedule vector (1 1)*' is used with the first three valid 
projection vectors and results in a computation time of 
(2*N-1)«7 clock cycles The schedule vector (2 1)^ is used with 
the last projection vector and results in a computation time 
of ten clock cycles The results are listed in table 4-2 


Table 4-2 Comparison of architectures for DFT 
(Matrix-vector multiplication algorithm) 


Pro jecti on 
vector 

Transformation 

matrix 

Cells Initial computation 
offset time 

Throughput 

(1 0)'' 


'l 1 ■ 
0 1 



4 - 7 

1 

(0 1)^ 


\i l' 

I 

i 


4 - 7 

- 

(1 1)^ 


A A ; 



7 3 7 

(1/2) 

(1 -1)*' 


‘2 i‘ 

i 1 

• «• 



7 3 10 

1 

Solutions 

i,: 

2 and 

4 

have no final offset 

and are 

'parti tionabl e 

' Solution 

3 has a final offset of 

three clock 

cycles and 

is 

not 

parti tionable 



For minumum chip area, the best solutions are 1 and 2 Note 
that the total required execution time includes the loading of 



7 0 


input sequence x for solution 1 and unloading of the output 
sequence W for the solution 2 Further all systolic solutions 
except 3 are part i t i enable providing the facility of mapping 
large DFTs onto these architectures 

But efficient architectures can be obtained by employing the 
Horner's nested rule to the DFT algorithm Thus the DFT 
algorithm can be written as 

Wi * ^0”^ wMx2+ + (4-9) 

The above equation can be written as the following recursive 
equation which is localized 

If k* - ( if If 

yi * y^ * 

y^*^ » 0 , « y^*^ (4-10) 

^i*^ - * ^'N-k 

W|*^ m , W|^ « W^“'^ 

The DQ obtained from #q <4-10) is shown in figure 4-22 The DG 
IS homogeneous consisting of MAC cells only The dependence 
vectors and index set of the DG are given below 



2 [123412341234123 4] 

• 111122223333444 4j 

The first dependence vector is of variable x the next is of 
variable w and the last is of variable y There again exists 
four valid projection vectors The resulting architectures were 



n 


shown in figure 4-23 through 4-26 The transformation matrices 
and architectural details of these solutions are listed in 
table 4-3 


Table 4-3 Comparison of architectures for DFT 
(Horner s nested rule version) 


Pro jecti on 
vectors 

Transformation 

matrix 

Cel 1 s 

Initial 

offset 

Computat 1 on 
t ime 

Throughput 

(1 0)*' 


1 l' 

0 1 
• « 


4 

- 

7 

1 

(0 1)*' 


'l l' 
1 0 


4 


7 

- 

(11)*' 


i 1 -1 

L J 

! 

7 

3 

7 

(1/2 ) 

(1 -1)*' 


' 2 1 ' 

1 1 


7 

3 

10 

1 

Solutions 

1 , 2 and 4 

have 

no final offset 

and are 

'parti tionable 

' Solution 

3 has 

a final 

offset of 

three clock 


cycles and is not parti iionable 

Note that the archi lectures resulting from the DO in figure 
4-22 have an advantage in that only boundary cells does the 
input/output operations where as in the architectures resulting 
from the DQ of figure 4-17, all cells does the I/O operations 
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Fig 417 DG for DPT ( Matrix -Vector Multiplication Version) 
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Fig 4ie Systolic Solution for DFT ,p=(1 0)' 
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4 2 4 LU DECOMPOSITION 



An NXN LU decomposition algorithm, C •«: AIB where C ® is 

«n NxN non-sinsuler metrix, A * is an NxN lower 

Inenoular matrix with a unit diagonal and B « (Li,) is an 
upper triangular matrix, can be described as follows 


•ii * 


^^11 “ JS *ik**^k 1 } 

^ S * 


for i«2 3 N and j**! 2 i-i 


^13 “ “ !IC»ik*‘^kj 2 N and )=i N (4-11) 

k < 1 


For N»4 , the matrices are as follows 
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C 44 
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eq 

(4-11) can be formulated as 
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N 

and 

k < 
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The algorithm of the eq (4-12> needs to 
are missing indices of variables a and b 
version of the tq ( 4 - 12 ) is given below 


^i i 

. ^ 
‘i » i 


k ^ k -1 ^ i b ^ 


m C 


M O 


i . j 

I 

1 , i 


k -1 


/ b 


4 j 

» 

1 • i 


i < 3 

if i m k 


* *i ,i-l 


be localized as there 
The locally recursive 


if 3 k 


(4-13) 
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. k ^ k~l 

bi 3 " ^3 3 

K k 

- ^=1 ) ' « 
fork«!i,2 N , i 

The localized DG y% shown 
dependence lines in the k-d 
<1 3 *k) to (i,j,k+l) are not 


1 f 1 SB k 

If 1 # k 

1 1 *1 N ' 

“ k , ,N and 

in figure 4-27 
irection that run 
drawn 


^1 J ® 

J » k, N 
Note that the 
from the point 


The DG is non-homogeneous consisting of 'multiply and subtract 
operation (MS) cells, 'division operation' (D) cells and 
'equating operation' (E) cells Let us denote the DG consisting 
of MS cells as DG-j, , DG consisting of D cells as DG-2 and the 
DG consisting of E cells as DG--3 The dependence vectors of the 
various DGs are given below 


‘o 1 o' 


f 

1 

10 0 

, Do w 

0 

0 0 1 

Jm 

0 


The first dtpendenoi vector of Dj^ is due to the pipelining of 
variable a, the second vector is due to the pipelining of the 
variable b and the last vector is of variable o DG-2 has only 
one dependence vector to pipeline the variable b DG-3 has no 
dependence vector Further the following dependence vectors 
exists between the 00s 
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32 




and no dependence vector exists from DG-2 to DG-3 



79 


The index net of veriou# DG* ere given below 
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It has been found that there exists only one valid projection 
vector, namely (1 1 1 ) '' and the schedule vector <1 1 1)^ 


results in 
processor 

r 1 0 -ii 

1-10 


a minimum computation time of ten clock cycles 


basis A correspondina to this projection vector 

1 1 l' 

Hence the transformation matrix T is ] 1 D-1 

1 -iO 
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IS 

The 


transformed 


index set and dependencies 


are printed below 
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The architecture built Nith this information is shown in 
figure 4-28 Note that the transformed dependencies T I 
indicates the links that exists within SSA-i where as T * D 
indicates the links that exists from SSA-i to SBA-i Further 
note that a latch has been used in the place of E cell The 
architecture of figure 4-26 has an initial offset of three 
clock cycles and final offset of three clock cycles Hence the 
total required execution time is (3+3+10)«Bi6 clock cycles The 
throughput per pipeline is (1/3) as only one output component 
appears for every three clock cycles in each pipeline 
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Pig 4 26 SysttoDc Solution for LU DecomposHion C = A B p=(111)' 
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4 2 5 FINITE IMPULSE RESPONSE (FIR) FILTER 

A common representation of a digital filter of order N is given 
by 


Un *• 


S f'i * i 


(4-14) 


1*1 1 wO 

where Xn and Wn are input and output signal^ respectively 

There are two classes of basic digital filters If the 
coefficients in eq (4-14) are all zero, then the filter is 
known as moving average' (HA) filter else it is known as 
'autoregressi v@ moving average' (ARMA) fitler Note that MA 
filters are FIR filters and ARMA filters are IIR filters 


Thus for FIR filter, the input-output can be described as 
& "i * »n-i 


Un * ^ wi 

i»«0 


The eq (4-15) oan be represented by the following backward 

reourrenoe equation 
i 


Ui' • 

i j 

W| w 


" ’^i-l 
» 0 


j —i 


*^i«i *“ 

i 


i 


(4-16) 


M-1 *• ^ 


M+i 


^ Wi-1 " Wl 

The DG obtained from tq (4-16) is shown in figure 4-29 for 
M « 4 All the oil Is does the MAC operation and hence the 

systolio solutions are homogeneous arrays Note the index set 
and dependenot veotor* of the DG It has been found that there 
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exists four valid projection vectors and the architectures are 
shown in figure 4-30 through 4-33 The results are furnished in 
table 4-4 These architectures can be constructed by computing 
the transformed index sets and dependencies and following the 
procedures outlined in the previous sub-sections Note that the 
pipeline of variable x in figure 4-34 exhibits non-near 
neighbour communication This possibility can be ruled out by 
putting th# condition TID. #mltT*D, where D. and D are 

* e 1 j 

dependence vectors of a homogeneous DG and m is a constant 
greater than one This is not incorporated in the ADSA program 
because the meaning of the term 'local oonneclivity ' is quite 
ambiguous 


Table 4-4 Comparison of architectures for FIR filter 


Projection Transf ormati on 
vector matrix 

Cells 

Initial 

offset 

Computation 

time 

Throughput 

(1 0)'' 

'l i’ 
0 A 


5 

- 

9 

1 

(0 i>^ 

i i 

1 0 

* * 


5 

4 

9 

- 

(1 11*' 

Jr J) 

jl Jl 


5 

4 

9 

(1/2) 

(1 -i)*' 

‘2 a‘ 

i i 

R * 


9 

4 

13 

1 


All architectures except that of figure 4-32 are 

’partUionable' 
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Purlher a di f f ®r «nt 
reported htr e can 

recurrence equations 
3 J+A 3 M 

Wjj * * 

3 3 A 




tifi 




0 


»et of architectures which are not 
be obtained by the following forward 


^1-1 “ (4-17) 

« Wi° 


In many application* it i» desirable to design filters that 
have linear phase In this way signals in the passband of the 
filter are reproduced exactly at the filter output except for a 
lime delay corresponding to the slope of the phase For a 
linear phase FIR filter 

Hence the eq (4-15) gets modified as follows 

Un - f w, > »n., + 

i«0 1*80 

where 8 » H/2 

Let the two summation terms in the above equation be 
represented by P and R respeotively Then the #q (4-18) can be 
reprisented by the following baokward reourrenoe equations 

« ^ 1 - 2 ^"'*^ ♦ ^ 
y|^ « 4 . * ^i^ * ^ 


M 

,M-1 

(4-19) 




Jv 

P,° - D R," . 0 , 


The DG constructed from ®q (4-19) as shown An fAgure 4-34 for 
H « 4 There exA^^ts only one valAd projectAon vector namely 
(1 0)^ end Ihr srheriule vector (i l)^ is used Thus the 


Iransf ormAt A or\ malrAX T as 


1 1 
0 1 


The resultAng archAteoture as 


shown in fAgur® 4-3^ ThA» architecture has no AnitAal and 
fAnal offsets Its computation time and hence total required 
execution lime as seven clock cycles and throughput as one 
Note that this architecture requires only half the number of 
cells as compared with the architecture of figure 4-30, but 
there a« an increase in the complexity of the cell It is 
possible to design an architecture with less complex cell 
structure by remodelling the DG in such a way that the 
'multiplication of h and x variables ooours after the 
'addition of proper x sompontnts 1185 


Further not® that th® architecture for even number of filter 
ooiffioients Hill be different For example, for M •» 5, the 
trohiticiure will have three MAC-1 oells followed by an 'adder 
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Fig 4 30 Systolic Solution tor FIR FHter p=(l 0)^ 
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4.2.6 RECURSIVE CONVOLUTION 


The recursive convolution equation can be expressed as 

yn ~ *^1 * '^n-i (4-20) 

1-1 

Eq.( 4~20) can be represented by the following forward 
recurrence equation . 


Ri 


0 


. rji , R,>' 




ri = ri_i 

N+1 


Ri^ = Ri-1 

j 


j-l 


= Q 


= rj_jL (4-2i) 

u - R 

^i - ^i+N+1 


Bs yQ ; 

for j ss N , N-1 , . . 2 , i 

The DG corresponding to the eq.(4-21) is shown in figure 4-36 
for N ss 4 . The systolic solution will be a non-homogeneous 
array because the DG contains MAC cells and E cells. There 
exists only one valid projection vector, namely (i 0)^ and a 
schedule vector <2 -1)^ is used. Thus the transformation matrix 
The resulting architecture which is obtained from the 
transformed index sets and dependencies is shown in figure 
4-37. Note that a 'latch' can be used in the place of E cell. 
The architecture has an initial offset of N (four) clock cycles 
and final offset of N (four) clock cycles. The computation time 
is 2*m-l (eleven) clock cycles where m is the length of the 
output sequence. And its throughput is half. 


i s 


2-1 
0 1 


An architecture with a throughput of one can be obtained by 
applying the dMcle-and-oonQuer technique. This modified array has 
about the same area as the array of figure 4-37. 




From eq.(4~20), obtain 
N 

^0-1 “ '"i * '='n-i-i (4-22) 

i -I 

Further eq.(4-20) can be written as 


Un - X * Up-i (4-23) 

lm2 

Hence substituting eq.(4-22) into eq.(4-23), we have 


Un “ r^ * ^ Ti * 

iB=i 1=2 


N+1 

“.Z * yp_i 

i«s2 



(4-24 ) 

where aj^ « r^ + r^ * '^i-1 

for 

2<:i^ N 

and a,^^i * 

This modification does 

not 

change 

the total number of 

coefficients . 





Dividing the eq.(4-24) into sub-equations, we have 


1 n 



Ittsnl 

Qn » 

nriisl 


eq. (4-25) 


*2m * yn-2m 

*2m+l * ^n-(2m+l) 

, it is assumed that 


N is even 


(4-25) 

without loss of 


generality. The two sub-equations of eq.(4-25) can be treated 
as two separate problems and hence two separate DGs can be 
obtained. Further these two DGs are linked by the relation 
yp = Pp + Qp. Thus the combined DG which is obtained from the 
following recursion equations is shown in figure 4-38 for 


N « 4 


. ..i « R.j 
1 * ^1 


*S P^'”* + a/ i R 


®i * ®i-l 


= a^ 



*N+2-2 j 



3 r% 3 + i- 

Ri *= f^i-2 

f 

p N/2+1 

Ri 

u N/2+1 

'^i 

and 

Pi° = 0 

for j = 1 2 

N/2 





« 3 3 + i 

Qj * ^1-2 

* s/ 




(4-26) 

3 « 3 

*1 ®i-l 


^23-(N+1 ) 

= a^^ 



Si^ b: 

> 

3 N/2+1 

_ p N/2+1 
= 1 

and 

Qq^ *= 0 

for 3 * N/2+2 , 

*N+1 





N/2+1 „ p N/2 

^ n _N/2+2 





“ Pj + Qi_2 


c N/2+2 c N/2+i 

Si = Si_^ 

Note that we have introduced two variables R and S to propagate 
the components of y 


let us denote the DG consisting of MAC-1 cells alone by DG-1 
the DG consisting of 'adder' (A) cells alone by DG-2 and the DG 
consisting of MAC-2 cells alone by DG-3 Note the index sets 
and the dependence vectors within a DG and from one DG to 
another DG It has been found that there exists only one valid 
projection vector* namely <1 0)^ and a schedule vector (1 1) 
is used Hence the geometric transformation matrix T is 


1 1 
□ i 


The resulting architecture is shown in figure 4-39 It has an 


initial offset of N/2 (two ) clock cycles and a final offset 
of N/2 (two) clock cycles The computation time is 1 clock 


cycles where 1 is the length of the input seouence 



9;i 



Fig 4 36 DG for recursive convolution 
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Fig 4 37 Systolic solution for recursive convolution , p-(1 0) 






Fig 4 38 DG for recursive convolution (Divide and conquer version) 
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Fig 4 39 Systolic solution for recursive convolution , P-(1 0) 
(Divide and conquer version) 







4 2 7 INFINITE IMPULSE RESPONSE (HR) FILTER 


For an HR filter the input-output can be described as 


Un » S + :& ^ 

i-0 i«l " ^ 


(4-27) 


A set of archi tectures which are not reported here can be 
obtained from a DC which is obtained by combining the DCs of a 
FIR filter and a recursive convolution equation corresponding 
to the first and second summation terms in the eq (4-27) 
respect i vel y 


However a new architecture which is obtained by remodelling the 
eq (4-27) is presented here 
The eq (4-27) can be written as 

yn “ Z 

1 «S!0 

Where (4-28) 

JbbO 

and S rm * 

m«“l 

The above recurrence equations are obtained by first expressing 
the past output samples in terms of the input Xn only (starting 
With then substituting them in the expression for 

the present output sample so as to make it to depend upon the 
input Xn only 

In the eq (4-28), the first two recurrence equations 
corresponds to the 'linear convolution' (or FIR filter) 
algorithm and the last recurrence equation corresponds to the 



recursive convolution aloonthm The resulting architecture 
IS shown in figure 4-40 for M « 2 N » 3 and an input sequence 
of five samples This architecture is obtained by concatenating 
the architectures of recursive convolution with the 
architectures for FIR filter Note that many architectures for 
IIR filter can be obtained by choosing the different 
architectures for recursive convolution and FIR filter 
algorithms In the architecture of figure 4-40 we have used 
the architecture of figure 4-39 for recursive convolution (note 
that the coefficients r^ gets modified to a^ as discussed in 
sub-section 4 2 6) and the architectures of figure 4-30 and 
figure 4-31 for FIR filter algorithm 
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CHAPTER 5 


CONCLUSIONS AND SCOPE FOR FURTHER WORK 

Systolic array structures are known for their regular 
structures and simple cells These arrays operate synchronously 
performing dependent computations through the pipelines and 
independent computations in parallel The advantage of using 
extensive pipelining a.nd multiprocessing makes them suitable for the 
implementation of many computationally intensive problems 
Although they have many advantages in their architectural 
properties, their design involves stringent data scheduling 
the right datA has to reach the right cells at the right time in order to 
preserve algorithm s correctness This involves the 
determination of the spatial and timing distributions of data 

Several attempts have been made to systematize the design of 
systolic arrays either directly from an algorithm or from an 
intuitive non-systolic design Presented here is an 
introduction of a simple and systematic systolic array design method aimed 
at solving some of the existing design problems The design method in 
general, follows three major steps the algorithm representation 
algorithm model and architecture specification The algorithm representation 
involves transforming the algorithm in the form of 
mathematical expressions, into a set of locally recursive 
equations -At present, t-Mis process is ad hoc but research is 
going on to systematize this step In algorithm model step a 

98 
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Dependence Graph <DG) is Buslematicallu obtained from these 
recursive equations From this model the computations are 
first scheduled and then grouped among a set of cells such that 
the systolic array properties such as simple and regular data 
and control communication, which are represented in terms of a 
set of conditions based on the dependence information and index 
set of the DG , are preserved The grouping of computations is 
done via geometric projections Thus the valid projection 

vectors are systemat ical ly derived from the DG In architecture 
specification step, processor basis A corresponding to each valid 
projection vector is determined as the graph-based projection 
procedure may be described in terms of algebraic 

transformations Then a geometric transformation matrix T *= ^t 
IS formed The design information such as the number of cells 
operations performed in each cell and data timings are 
extracted from the transformed index set of the DG Cell 
interconnect i ons and data movement are extracted from the 
transformed dependencies 

Since the state-of-art of current technology does not recommend 
three or higher dimensional architectures the algorithm for 
multiprojeotion scheme which maps four or higher dimensional DGs 
such as 2D oon\/olution onto a planar systolic array, is developed 
Further this method supports the scheme of mapping problem of larger 
size onto the resulting smaller architecture, which is not so with many 
of the existing methods 
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As the scope of further work is concerned the concepts of 
muliiprojeciion and pariitwmnff can be readily incorporated into the 
ADSA package More research should be oriented towards 
selecting a best algorithm for the problem under consideration 
as enumerated for FIR filter, DPT and recursive convolution 
Further the step of transforming the algorithm into locally 
recursive equations should be made automatic A new 
optimization theory is needed to be developed which not only 
considers the area time and 1/0 bandwidth measures (note that 
the architectures with non-slationary data needs more I/O 
bandwidth) but also the purpose for which the architecture will 


be used etc etc 


The non linear transformations such as 


ffll ‘ 

mod2 1 


of 


Quinton [93 which produces block cont^oli^er ( an architecture for 


linear convolution, whose throughput can made as large as 
possible, of course with corresponding increase mthe number of 
cells) and the cell sharing transformation of Cappello [53 are 
quite interesting A comparative study of the method presented 
in this thesis, Moldovan's method [123, Cappello's method [53 
and Quinton's method C93 is worth rendering 



APPENDIX 


A 1 PART 1 T 1 ON I NG 

Consider a FIR filter algorithm described by 

Um » S (A-i) 

1 •■0 

The DG which is obtained from the backward recurrences of 

eq (4-16) is shown in figure A-1 for M=7 and m=7 As already 

mentioned in section 425 there exists three valid projection 

vectors* namely (1 0)^ (0 1)^ and (1 1)^ for the schedule 

vector (1 1)^ The projection vector (0 1)^ has been considered 

for further discussion Thus, the transformation matrix 
fill 

T « p The resulting architecture which is obtained from 
■ - 

the transformed index set and dependencies is shown in 

figure A-2 Note that this architecture needs as many cells as 
the number of samples of the output sequence ym which is not 
feasible Thus, we need to map the computations of a larger 

size problem onto the smaller architecture and this is known as 
the partitioning' It is a basic requirement in many practical 

system designs as can be recalled by Heller s tripartite 

corollary [203 to Murphy's law 

No matter a'hat special purpose device is available there is a problem too 
large for it 

The problem will manifest itself only after the de\/ice is accfuired and can no 

longer be modified 

The problem can not be ignored 

lOl 
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To design the partitioning scheme, the following factors should 

be taken into account [101 

1 Minimum overall computation time 

2 Minimum control overheads 

3 Balanced tradeoff between external communication and 
local memory 

It IS necessary to specify at what time and in which processor the 
computation of each index point of the DC takes place For a 
systematic mapping from the DG onto a systolic array the DG is 

regularly partitioned into many blocks, each consisting of a 

cluster of nodes in the DG For convenience of presentation 
the following mathematical notations are adopted Suppose that 
an n-dimensi onal DG is projected to an (n-1 )-dimensional array 
of size LjLXL 2 X The array is partitioned into 

MjXM 2 X blocks, where each block is of size 

Zj|^XZ 2 X X2^„j^ Obviously, for i»l 2, n-1 / M^ 

There are two methods for mapping the partitioned DG to an 

array locally sequential globally parallel (LSGP) method and 
locally parallel globally sequential (LPGS) method 
1 LSGP scheme 

In the LSGP scheme, one block is mapped to one processor Thus 
the number of blocks is equal to the number of processors in 
the array, i e , the array size equals to the product 
MiXM 2 X Each processor sequentially executes the 

nodes (indices) of the corresponding block In order to store 
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the node data in the block local memory within each processor 
js needed, i e local memory size should be c'xZ^y7.2X X^n-i 
where o is a small constant (usually 1 2 or 3) depending on the 
data dependencies As long as the local memory is large enough 
for the computation under consideration, the LSGP approach is 
quite appealing 


2 LPGS scheme 

In the LPGS scheme the block size is chosen to match the array 
size, 1 e one block can be mapped to one array All nodes 
within one block are processed concurrently i e locally parallel 
One block after another block of node data are loaded into the 
array and processed in a sequential manner i e , globally 
sequential In this scheme, local memory size in the processor 
can be kept constant, independent of the size of computation 
All intermediate data can be stored in certain buffers outside 
the processor array Usually simple First-in-Fi rst out (FIFO) 
buffers are adequate for storing and recirculating the 
intermediate data efficiently tl83 


The LPGS scheme tl2J can be readily incorporated into the ADSA 
package and hence will be discussed in detail 


Suppose that the algorithm of the figure A-1 has 

a linear array of fixed size, say P-^ 

again which has 


transf ormat i on matrix T >* 


1 1 
1 0 


the DG under consideration This transformation 


to mapped onto 
Consider the 
been found for 
can be used for 
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partitioning the algorithm if we enhance its interpretation 
Consider that the last row (generally last n-i rows) of T is 
some partitioning hyperplane as defined by 
Ar « <1 

This hyperplane partitions the index space of the figure A-i 
into bands The timing hyperplanes sweep all the index points 
in each band before another band is processed The band 
boundary must satisfy the equation 

Ar*ij^, mod 4*0 (A-2) 

where is an index point of the DG 


Thus all index points inside one band are processed before 
another band is considered Inside each band the index points 
are swept by a family of parallel time hyperplanes s * (1 i.)^ 
Condition <A-2) assure that at any given moment no more than P 
index points are processed For each index point it is needed 
to determine to which band it belongs and in what processor it is mapped 
In order to distinguish between different bands, a coordinate b 
has been assigned to each band Thus an index point will be 
assigned to a band 



The allocation of computations to processors is done according 
to the formula 

A 

i(( * Ar*i(< mod 4 

For example the index point (5,4> will be allocated to 
processor (i) in band 
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Further the exact processing time for each index point 
depends upon the order in which the bands are executed Many 
execution orderings are possible 


In table A- 

-1 , the 

original index 

set for 

the 

al gor 1 thm 

of 

figure A-1 

and the 

allocation of 

indices 

to 

bands and 

to 


processors are shown Note that the execution of band Bg 
followed by the band is chosen for the algorithm under 

consideration Thus the time coordinate in table A-1 was 
determined by first executing all points inside one band 
according to the ordering imposed by s and then moving to the 
next band Based on the information furnished in the table A-1 
the architecture of figure A-3 is constructed as illustrated in 
chapters 3 and 4 Note that the dependencies are mapped by the 
transformation such that only local communications are required 
inside each band The communications between the computations 
in adjacent bands are performed via external FIFO queue 
registers 

Note that the architecture of figure A-2 needs eight cells and 
results in an execution lime of 22 clock cycles where as the 
architecture of figure A-3 needs only four cells and a FIFO of 
SIX memory locations and results in an execution time of 22 
clock cycles (excluding the time to load the variables 
X and o) Note also how the computations are indexed (starting 
from 0 rather than 1) in order to have minimum number of bands 
and hence minimum execution time [101 
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Table A-1 
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index 
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A 2 MULT I PROJECT I ON 


iOS 


In chapter 3 a methodology has been proposed which maps an 
n— d 1 mens 1 ona 1 DG onto an ( n—i )— dimensional systolic array In 
principle^ the method can be applied k times and thus reduces 
the dimension of the array to n-k This scheme is called the 
'multiprojecUon method 

The potential difficulties of this mapping are (1) the presence 
of delay edges in the (n-1)- dimensional array and (2) the 
possibilities of loops or cycles, although no loops or cycles 
can exist in a DG 

To handle the cycle problem, an instance praph (at certain lime 
tsstg) IS defined as the array with all self loops removed Hence 
the instance graph has no loops or cycles According to the 
schedule s, all the nodes in the instance graph are executed 
simultaneously at tx^tQ An aotii^ity instance (at t=tQ) can be 
defined as the nodes represented by the instance graph at ta=tQ 
Further according to the schedule s, there will be no overlap 
in time between two consecutive activity instances (one at tstg 
and the next instance at l*=to + D) 

Recall that a mapping methodology should consist of a schedule 
part and a projection part As to the schedule part it is 
always possible to find a valid schedule vector s for the 
instance graph, by using the eqs (3-5) and (3-6), since there 
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IS no loop OP oycls in ths instdncc 9 P 3 ph And & v&lid 
ppojsction vsctop p fop ths instsncs spaph can bs found 
accopding to the ppocedupe outlined in chaptep 3 

To ppoject an ( n-i )-dimensional instance gpaph to an 
(n-2 )-dimensional gpaph it is necessapy to create a new type of 
delay denoted by r Note that the global delay D and local 
delay r ape intimately related The relationship depends upon 
the constraints imposed by the processor ai'ailabihty To ensure 
processor availability an activity instance must have adequate 
time to be complete before the next activity instance starts 
So the following condition is necessary 

D i T + <M-l)*(s' p')*^ (A-3) 

where M is the maximal number of nodes along the p'^direction in 
the instance graph Note that this condition also guarantees 

that there is no time overlap between two activity instances 

The neu' transformation matrix T'^ •« [ formed where is the 

processor basis corresponding to the new projection vector p"^ 
Then nei*' transformed index set and dependencies are computed (consider 
the self loops also) by considering the index set (last n-1 rows 
of the old transformed index set) and dependencies (last n-1 
rows of the old transformed dependencies) of all instance graphs 

Further, the new transformed index set has to modified by 

adding (R^^-1 )f [1+(M-1 )*(s' p') 3 to the first row of the new 

transformed index set Thus the first row of the modified new 
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transformed index set specifies the timestep at which a computation 
will occur Since the original delay link d vith delay weight /3D map to a 
link bearing delay weight &d+(s'd)7 the new transformed dependencies are 
also modified by adding the first row elements of the old 
transformed dependencies multiplied by (D/t) to the corresponding 
first row elements of the new transformed dependencies Thus 
the first row of the modified new transformed dependencies specifies 
the amount of external delay on various links With this 
information, the architecture is constructed as explained in 
chapter 3 


For example consider the architecture of figure 4-4 which is 
obtained by projecting the 3-dimensional DG of matrix-matrix 
multiplication CssAIB algorithm in (1 0 0)^ direction Note the 
original index set of this algorithm which represents the 
various computations Further from the transformed index set 
it can be found that there exists seven activity instances 
viz at t»3,4, ,9 By considering the instance graph at t=6 

a valid schedule vector <1 0)^ is found which results in 

minimum computation time Further there exists three valid 
projection vectors namely <1 0)^, <1 1)*' and <1 -1)^ Consider 
the projection vector (1 1)^ for further discussion 


Then from eg 
transformation 


<A-3), D2:3t because M=3 Let D=3 t The geometric 

matrix, corresponding to this projection vector 
The new transformed index set and dependencies are given 


here 


L 
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1 23434545623434 45^2 

12512312^1231231231 

[ooi] 


2 3 4 3 4 5 4 5 6 


2 3 1 2 3 1 2 3 


The modified new trensformed index set end dependencies ere printed below 


Tf * L 10 b 10 14 10 14 16 5 5 1j 9 13 17 13 17 21 6 12 16 12 l£ 20 if 20 24 
|J2312'11 31231 312’'12j123S125 

f r4 3 4 ' 

From /id it can be noted that the variables a and b stay 
within the cells and they need an external delay of three and 
two clock cycles respectively This means that a particular 
element of the matrix A will be used again and again after 
every three clock cycles and that of matrix B will be used 
after every two clock cycles Following the guidelines provided 
in chapter 3 for constructing the systolic array from the 
transformed index set and dependencies, the architecture of 
figure A-4 is obtained from /ll^ and tIid 

Note that the architecture of figure 4-4 needs nine cells and 
the total required execution time of seven clock cycles 
(excluding the time needed to load the elements of matrix B) 
In the architecture of figure A-4 there are only three cells 
but the total required execution time is 23 clock cycles 
(excluding the time needed to load the elements of both 
matrices A and B) Further the cells of the architecture of 
figure A-4 needs some IocmI memory to store the elements of 


matrices A and B 
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